Implementing Linear Regression

In this blog post, I implement least-squares linear regression, and experiment with LASSO regularization for overparameterized problems.
Author

Sally Liu

Published

March 24, 2023

Objective:

The blog post consists of five parts:

  1. The implementatino of Linear Regression in two ways (Analytical Formula and Gradient Descent).

  2. The demonstration and testing of my implementation.

  3. The experiments of how the increase in p_features can affect training and validation scores.

  4. The implementation and experimentation with LASSO regularization.

  5. The training of an instance of my LinearRegression class on the bikeshare training data.

%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
np.seterr(all='ignore') 

from sklearn.datasets import make_blobs
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Part 1. Implement Linear Regression

  • I define separate methods called fit_analytic and fit_gradient for the two methods.
from source import LinearRegression

LR = LinearRegression()

Part 2. Demo

I use the following function to create both testing and validation data, which can be used to test my implementation:

import numpy as np
from matplotlib import pyplot as plt

def pad(X):
    return np.append(X, np.ones((X.shape[0], 1)), 1)

def LR_data(n_train = 100, n_val = 100, p_features = 1, noise = .1, w = None):
    if w is None: 
        w = np.random.rand(p_features + 1) + .2
    
    X_train = np.random.rand(n_train, p_features)
    y_train = pad(X_train)@w + noise*np.random.randn(n_train)

    X_val = np.random.rand(n_val, p_features)
    y_val = pad(X_val)@w + noise*np.random.randn(n_val)
    
    return X_train, y_train, X_val, y_val

Use the LR_data function to generate some data:

n_train = 100
n_val = 100
p_features = 1
noise = 0.2

# create some data
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)

# plot it
fig, axarr = plt.subplots(1, 2, sharex = True, sharey = True)
axarr[0].scatter(X_train, y_train)
axarr[1].scatter(X_val, y_val)
labs = axarr[0].set(title = "Training", xlabel = "x", ylabel = "y")
labs = axarr[1].set(title = "Validation", xlabel = "x")
plt.tight_layout()

Implement my two solution methods for Least-square Linear Regression

(1) Analytical Method:

LR.fit_analytic(X_train, y_train) # analytical formula 

print(f"Training score = {LR.score(X_train, y_train).round(4)}")
print(f"Validation score = {LR.score(X_val, y_val).round(4)}")
Training score = 0.5817
Validation score = 0.4624

The estimated weight vector w is:

LR.w
array([0.81730721, 0.59082072])

(2) Gradient Descent method:

LR2 = LinearRegression()

LR2.fit_gradient(X_train, y_train, alpha = 0.01, max_iter = 1e2)
LR2.w
array([0.81678343, 0.5911116 ])

I can get very similar value of weight with gradient descent method.

Now, I can see how the score changed over time

plt.plot(LR2.score_history)
labels = plt.gca().set(xlabel = "Iteration", ylabel = "Score")

Based on the plot, the score should increase monotonically in each iteration.

Part 3. Experiments

Here, I perform an experiment in which I allow p_features, the number of features used, to increase, while holding n_train, the number of training points, constant.

  • I increase p_features, from 1 all the way to n_train-1, and see the change in training score and validation score.
from source import LinearRegression
from matplotlib import pyplot as plt

LR = LinearRegression()

n_train = 100
n_val = 100
p_features = 1
noise = 0.2

done = False

train_score_history = []
val_score_history = []
p_features_history = []

while not done:
    # create some data
    X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
    LR.fit_analytic(X_train, y_train)
    
    p_features_history.append(p_features)
    train_score_history.append(LR.score(X_train, y_train))
    val_score_history.append(LR.score(X_val, y_val))

    p_features = p_features+1

    if p_features == n_train:
        done = True

plt.plot(p_features_history,train_score_history, label = "Training score")
plt.plot(p_features_history,val_score_history, label = "Validation score")
plt.xlabel("p features")
plt.ylabel("scores")
legend = plt.legend() 
plt.show()

print("Training score :", train_score_history[-5:])
print("Validation score :",val_score_history[-5:])

Training score : [0.9994080138357238, 0.9999632920224747, 0.9997989179387414, 0.9999287317250287, 1.0]
Validation score : [-0.5832790674954091, 0.8902637288952078, 0.6393265453474528, 0.27865576563323635, 0.7239841429657712]
Observations:

According to the plot:

  1. When n_train is larger than p_features, the linear regression model performs well on the training data, and overfitting does not occur.

  2. When n_train is close to / not much larger than p_features, the training score increases and reaches 1.0 when p_features = n_train. However, the validation score tends to be unstable and gets arbitrarily negative in the end, meaning that the model overfitts the data.

Therefore, if we have too many features, the linear regression model may fit the training dataset very well, but it may lead to overfitting and fail to predict new data correctly.

Part 4. LASSO Regularization

I implement the LASSO Regularization from sklearn linear model:

  • alpha controls the strength of the regularization
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
L = Lasso(alpha = 0.001)

Fit this model on some data and check the coefficients

p_features = n_train - 1
X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
L.fit(X_train, y_train)
Lasso(alpha=0.001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print("Training score:", L.score(X_train, y_train))
print("Validation score:", L.score(X_val, y_val))
Training score: 0.9991166922705192
Validation score: 0.816469412637295

Now I replicate the same experiment I did with standard linear regression, increasing the number of features up to or even past n_train - 1, using LASSO instead of linear regression:

n_train = 100
n_val = 100
p_features = 1
noise = 0.2

done = False

train_score_history = []
val_score_history = []
p_features_history = []

#X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)

#L.fit(X_train, y_train)

while not done:
    # create some data
    X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
    L.fit(X_train, y_train)

    p_features_history.append(p_features)
    train_score_history.append(L.score(X_train, y_train))
    val_score_history.append(L.score(X_val, y_val))

    p_features = p_features+1

    if p_features == n_train+10:
        done = True

plt.plot(p_features_history,train_score_history,color = 'orange', label = "Training score")
plt.plot(p_features_history,val_score_history, color = 'purple',label = "Validation score")
plt.xlabel("p features")
plt.ylabel("scores")
legend = plt.legend() 
plt.show()

print("Training score :", train_score_history[-5:])
print("Validation score :",val_score_history[-5:])
/Users/sallyliu/opt/anaconda3/envs/ml-0451/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:631: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.504e-02, tolerance: 5.884e-02
  model = cd_fast.enet_coordinate_descent(

Training score : [0.9989300253675973, 0.99792540803609, 0.998569999697423, 0.9989241311754407, 0.9988367674488902]
Validation score : [0.6506501450219913, 0.7050627630831812, 0.7656971292037806, 0.6022547034988706, 0.6514233032991358]
Observations:

Based on the plot:

  1. When n_train is not close to / larger than p_features, the linear regression model performs well on the training data, the scores are relativly stable and overfitting does not occur.

  2. When n_train is close to / not much larger than p_features, the training score increases steadily towards 1.0. The validation score tends to be more and more unstable in the end, but it does not get arbitrarily negative. The validatio score seems to have more control. So the LASSO regulraization method may help prevent the model from overfitting.

Next, I experiment with a few values of the regularization strength alpha, and at the same time compare the validation scores for standard linear regression and lasso regularization when the number of features is large.

  1. Set alpha = 0.01
L = Lasso(alpha = 0.01)
LR = LinearRegression()
n_train = 100
n_val = 100
p_features = 20
noise = 0.2

done = False

val_score_std = []
val_score_lasso = []
p_features_history = []

while not done:
    # create some data
    X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
    LR.fit_analytic(X_train, y_train)
    L.fit(X_train, y_train)
    p_features_history.append(p_features)
    
    val_score_std.append(LR.score(X_val, y_val))
    val_score_lasso.append(L.score(X_val, y_val))

    p_features = p_features+1

    if p_features == n_train:
        done = True

plt.plot(p_features_history,val_score_std, color = 'blue', label = "std")
plt.plot(p_features_history,val_score_lasso, color = 'green',label = "lasso")
plt.xlabel("p features")
plt.ylabel("validation scores")
legend = plt.legend() 
plt.show()

print("Validation score for standard linear regression:", val_score_std[-5:])
print("Validation score for LASSO:",val_score_lasso[-5:])

Validation score for standard linear regression: [0.9245631621914553, -1.3386640736853632, 0.7534738161505369, 0.5483048951431753, 0.7208474603034747]
Validation score for LASSO: [0.5883800034017881, 0.6368464509680394, 0.5184516531017589, 0.6061285503492282, 0.6079333676050096]
  1. Set alpha = 0.05
L = Lasso(alpha = 0.05)
LR = LinearRegression()
n_train = 100
n_val = 100
p_features = 20
noise = 0.2

done = False

val_score_std = []
val_score_lasso = []
p_features_history = []

while not done:
    # create some data
    X_train, y_train, X_val, y_val = LR_data(n_train, n_val, p_features, noise)
    LR.fit_analytic(X_train, y_train)
    L.fit(X_train, y_train)
    p_features_history.append(p_features)
    
    val_score_std.append(LR.score(X_val, y_val))
    val_score_lasso.append(L.score(X_val, y_val))

    p_features = p_features+1

    if p_features == n_train:
        done = True

plt.plot(p_features_history,val_score_std, color = 'blue',label = "std")
plt.plot(p_features_history,val_score_lasso, color = 'green',label = "lasso")
plt.xlabel("p features")
plt.ylabel("validation scores")
legend = plt.legend() 
plt.show()

print("Validation score for standard linear regression:", val_score_std[-5:])
print("Validation score for LASSO:",val_score_lasso[-5:])

Validation score for standard linear regression: [0.7549815717405675, 0.7033608157667165, -5.667886382553809, -0.3100673675265291, 0.24877903484777042]
Validation score for LASSO: [0.203027700962511, 0.33477054239299164, 0.16070213355583274, 0.19505021823469149, 0.2370499617158901]
Observations

According to the plot of validation scores for standard linear regression and lass regularization, it can be seen that lasso regularization tends to have more stable validation score when the number of features is large. Even though it is generally lower than the scores of standard linear regression, it does not have very much fluctuations and will not get arbitrarily negative.

Part 5. Optional: Bikeshare Data Set

Download and save a data frame of data related to the Capital Bikeshare system in Washington DC:

  • This data set includes information about the season and time of year; the weather; and the count of bicycle users on each day for two years (year 0 is 2011, year 1 is 2012). This level of information gives us considerable ability to model phenomena in the data.
import pandas as pd
from sklearn.model_selection import train_test_split
bikeshare = pd.read_csv("https://philchodrow.github.io/PIC16A/datasets/Bike-Sharing-Dataset/day.csv")
bikeshare_raw = bikeshare
bikeshare.head()
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
0 1 2011-01-01 1 0 1 0 6 0 2 0.344167 0.363625 0.805833 0.160446 331 654 985
1 2 2011-01-02 1 0 1 0 0 0 2 0.363478 0.353739 0.696087 0.248539 131 670 801
2 3 2011-01-03 1 0 1 0 1 1 1 0.196364 0.189405 0.437273 0.248309 120 1229 1349
3 4 2011-01-04 1 0 1 0 2 1 1 0.200000 0.212122 0.590435 0.160296 108 1454 1562
4 5 2011-01-05 1 0 1 0 3 1 1 0.226957 0.229270 0.436957 0.186900 82 1518 1600

Our aim for this case study is to plot daily usage by casual users (as opposed to registered users). The total number of casual users each day is given by the casual column, Let’s plot this over time:

# import datetime
fig, ax = plt.subplots(1, figsize = (7, 3))
ax.plot(pd.to_datetime(bikeshare['dteday']), bikeshare['casual'])
ax.set(xlabel = "Day", ylabel = "# of casual users")
l = plt.tight_layout()

Work with a smaller subset of the columns, and to transform the mnth column into dummy variables:

cols = ["casual", 
        "mnth", 
        "weathersit", 
        "workingday",
        "yr",
        "temp", 
        "hum", 
        "windspeed",
        "holiday"]

bikeshare = bikeshare[cols]

bikeshare = pd.get_dummies(bikeshare, columns = ['mnth'], drop_first = "if_binary")
bikeshare
casual weathersit workingday yr temp hum windspeed holiday mnth_2 mnth_3 mnth_4 mnth_5 mnth_6 mnth_7 mnth_8 mnth_9 mnth_10 mnth_11 mnth_12
0 331 2 0 0 0.344167 0.805833 0.160446 0 0 0 0 0 0 0 0 0 0 0 0
1 131 2 0 0 0.363478 0.696087 0.248539 0 0 0 0 0 0 0 0 0 0 0 0
2 120 1 1 0 0.196364 0.437273 0.248309 0 0 0 0 0 0 0 0 0 0 0 0
3 108 1 1 0 0.200000 0.590435 0.160296 0 0 0 0 0 0 0 0 0 0 0 0
4 82 1 1 0 0.226957 0.436957 0.186900 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
726 247 2 1 1 0.254167 0.652917 0.350133 0 0 0 0 0 0 0 0 0 0 0 1
727 644 2 1 1 0.253333 0.590000 0.155471 0 0 0 0 0 0 0 0 0 0 0 1
728 159 2 0 1 0.253333 0.752917 0.124383 0 0 0 0 0 0 0 0 0 0 0 1
729 364 1 0 1 0.255833 0.483333 0.350754 0 0 0 0 0 0 0 0 0 0 0 1
730 439 2 1 1 0.215833 0.577500 0.154846 0 0 0 0 0 0 0 0 0 0 0 1

731 rows × 19 columns

Do the train-test split

train, test = train_test_split(bikeshare, test_size = .2, shuffle = False)

X_train = train.drop(["casual"], axis = 1)
y_train = train["casual"]

X_test = test.drop(["casual"], axis = 1)
y_test = test["casual"]

X_train.head()
weathersit workingday yr temp hum windspeed holiday mnth_2 mnth_3 mnth_4 mnth_5 mnth_6 mnth_7 mnth_8 mnth_9 mnth_10 mnth_11 mnth_12
0 2 0 0 0.344167 0.805833 0.160446 0 0 0 0 0 0 0 0 0 0 0 0
1 2 0 0 0.363478 0.696087 0.248539 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 0 0.196364 0.437273 0.248309 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 0 0.200000 0.590435 0.160296 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 0 0.226957 0.436957 0.186900 0 0 0 0 0 0 0 0 0 0 0 0

Train the analytical method of my LinearRegression class on the bikeshare training data.

from source import LinearRegression

LR = LinearRegression()
LR.fit_analytic(X_train, y_train)
  1. Score my linear regression model on the test set:
LR.score(X_test,y_test)
0.6967732383931238
  1. Compute the predictions for each day and visualize them in comparison to the actual ridership on the test set:
def pad(X):
    return np.append(X, np.ones((X.shape[0], 1)), 1)
import numpy as np
import matplotlib.pyplot as plt

X_test_ = pad(X_test)
y_predict= LR.predict(X_test_)

ax1 = plt.subplot(211)
ax1.plot( y_predict, color = "blue", label = "predict")
ax2 = plt.subplot(212)
ax2.plot( y_test, color = "green", label = "actual")

plt.show()

  1. Compare the entries w of my model to the corresponding entry of X_train.columns in order to see which features my model found to contribute to ridership.
  • Positive coefficients suggest that the corresponding feature contributes to ridership.
LR.w
array([ -108.37113627,  -791.69054913,   280.58692733,  1498.71511272,
        -490.10033978, -1242.80038075,  -235.87934918,    -3.35439712,
         369.27195552,   518.40875345,   537.30188616,   360.80799815,
         228.88148125,   241.31641202,   371.50385387,   437.60084787,
         252.43300405,    90.8214605 ,   919.07676215])
X_train.columns
Index(['weathersit', 'workingday', 'yr', 'temp', 'hum', 'windspeed', 'holiday',
       'mnth_2', 'mnth_3', 'mnth_4', 'mnth_5', 'mnth_6', 'mnth_7', 'mnth_8',
       'mnth_9', 'mnth_10', 'mnth_11', 'mnth_12'],
      dtype='object')
Conclusion:

‘yr’,‘temp’, and all “mnth” have positive coefficients, suggesting that they contribute to ridership. Among all, temperature has the highest coefficient, so nice weather has the most significant effect on ridership.